Self Training Wrapper Induction with Linked Data
نویسندگان
چکیده
This work explores the usage of Linked Data for Web scale Information Extraction, with focus on the task of Wrapper Induction. We show how to effectively use Linked Data to automatically generate training material and build a self-trained Wrapper Induction method. Experiments on a publicly available dataset demonstrate that for covered domains, our method can achieve F measure of 0.85, which is a competitive result compared against a supervised solution.
منابع مشابه
View Validation: A Case Study for Wrapper Induction and Text Classification
Wrapper induction algorithms, which use labeled examples to learn extraction rules, are a crucial component of information agents that integrate semi-structured information sources. Multi-view wrapper induction algorithms reduce the amount of training data by exploiting several types of rules (i.e., views), each of which being sufficient to extract the relevant data. All multiview algorithms re...
متن کاملEarly Steps Towards Web Scale Information Extraction with LODIE
SPRING 2015 55 Extracting information from a gigantic data source such as the web has been considered a major research challenge, and over the years many different approaches (Etzioni et al. 2004; Banko et al. 2007; Carlson et al. 2010; Freedman and Ramshaw 2011; Nakashole, Theobald, and Weikum 2011) have been proposed. Nevertheless, the current state of the art has mainly addressed tasks for w...
متن کاملAutomatic Wrappers for Large Scale Web Extraction
We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform informa...
متن کاملXPath-Wrapper Induction by generating tree traversal patterns
We introduce a wrapper induction algorithm for extracting information from tree-structured documents like HTML or XML. It derives XPath-compatible extraction rules from a set of annotated example documents. The approach builds a minimally generalized tree traversal pattern, and augments it with conditions. Another variant selects a subset of conditions so that (a) the pattern is consistent with...
متن کاملBoosted Wrapper Induction
Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures (“wrappers”) for highly structured text such as Web pages produced by CGI scripts. For suitably regular domains, existing wrapper induction a...
متن کامل